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Community structure exists in many real-world networks and has been reported being related to 
several functional properties of the networks. The conventional approach was partitioning nodes into 
communities, while some recent studies start partitioning links instead of nodes to find overlapping 
communities of nodes efficiently. We extended the map equation method, which was originally 
developed for node communities, to find link communities in networks. This method is tested on 
various kinds of networks and compared with the metadata of the networks, and the results show 
that our method can identify the overlapping role of nodes effectively. The advantage of this method 
is that the node community scheme and link community scheme can be compared quantitatively by 
measuring the unknown information left in the networks besides the community structure. It can 
be used to decide quantitatively whether or not the link community scheme should be used instead 
of the node community scheme. Furthermore, this method can be easily extended to the directed 
and weighted networks since it is based on the random walk. 

PACS numbers: 89.70.-a, 05.40.Fb, 02. 10. Ox, 02.50.-r 



I. INTRODUCTION 

Complex networks have been widely used to represent 
the systems composed of connected objects, and many 
system-wide behaviors, which have emerged from the 
pattern of connections, have been successfully explained 
with the help of this simple model jlj ^ . The rising pop- 
tilarity of complex network through several disciplines — 
including statistical physics, computer science, compu- 
tational biology, sociology, etc. — rests on many reasons; 
a major one is that many large scale networks have be- 
come available due to the advance in information tech- 
nology. The advantage of the large-scale networks is that 
many meaningful statistical properties can be studied ac- 
curately, for example, degree distribution, clustering co- 
efhcient, assortativity, and motif profiles. However, the 
big size of the networks also brings disadvantages. When 
the network is small, it is very easy to visualize the net- 
work, and the organization structure of the network can 
be perceived intuitively. Instead, when the size becomes 
large, a comprehensive understanding of the structure 
could no longer be gained directly, and some quantita- 
tive analyses are required. 

Community detection is one of the efforts devoted to 
the quantitative analysis of the organization structure. In 
many real-world networks, the nodes are connected nei- 
ther regularly nor completely randomly. Instead, some 
nodes are densely inter-connected to form the communi- 
ties, while these communities are loosely connected, rel- 
atively. This kind of network structure, which is usually 
referred as the community structure, is closely related to 
many dynamic processes on the network [31 S] . There- 
fore, detecting the community structure has become one 
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of the most important problems in the network research 
and many methods have been proposed to solve the prob- 
lem efficiently [5]. The map equation method [Hj, also 
known as Infomap method, has been considered one of 
the best performing methods [3 [7] . This method is based 
on the Minimum Description Length (MDL) principle [8] , 
according to which any regularity in the data can be used 
to compress the length of the data. Therefore, by con- 
sidering the community structure as the regularity of the 
network and the path of the random walk on the network 
as the data to compress, the commimity structure can be 
detected during the compression of the path description. 
This is the main idea of the map equation method and 
it will be explained in detail in Sec. [IT] 

While most previous researches for community detec- 
tion have focused on the community of nodes, some re- 
cent researches have started switching attention to com- 
munity of links [21 [9] and even cliques [10]. From the 
theoretical point of view, the community of link could 
be more intuitive than the community of node in some 
real-world networks, because the link is more likely to 
have a unique identity while the node tends to have mul- 
tiple identities. For example, most individuals in the 
society belong to multiple communities such as families, 
friends, and co-workers while the link between a pair of 
individuals usually exists for a dominant reason. From 
the practical point of view, overlapping communities of 
nodes, which is another attractive topic of community 
detection [TTHH] could be detected as a byproduct be- 
cause the links connected to a single node could belong 
to different link communities and consequently the node 
could be assigned to multiple communities of links. But 
exclusive partitioning of links is not always accurate and 
this problem is discussed in Sec. |IV| The clique commu- 
nity is going further in this direction since a link is a 
clique of two nodes. 

In this paper, we propose a modified version of the map 
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equation method, which can be used to detect Unk com- 
munities under the MDL principle. In Sec. a brief re- 
view of original map equation is presented and the modi- 
fied version of the map equation method is introduced in 



the following Sec. Ill The best way to check the perfor- 
mance of a community detecting method is to compare 
the community result with the metadata available. We 
apply our method to several networks with rich metadata 
information, and the results are quantitatively compared 
with other methods for community detection in Sec. |V] 
An important advantage of our method is that the results 
of link community and node community can be quanti- 
tatively compared. In Sec. VI a model network is pro- 



posed to verify this property and the comparison is done 
in some real-world networks to show which partitioning 
scheme — link community or node community — can de- 
pict the organization structure of these networks more 
properly. 

For the simplicity of derivation, only the binary and 
undirected network is considered in this paper. The ex- 
tension to weighted and/or directed networks is briefly 
discussed at the end of Sec. IIIII 



II. THE MAP EQUATION FOR NODE 
COMMUNITY 



The most general definition of the community is that 
a community is a group of nodes that are densely inter- 
connected. Meanwhile, from the viewpoint of informa- 
tion propagation, another definition can be proposed: A 
community is a group of nodes in which the information 
is more likely to be trapped rather than spread out. Con- 
sidering that the random walk is the most fundamental 
model of information propagation, community structure 
can be detected by finding the local structure that traps 
the random walker. Some recent studies [Ul [TB] have 
showed that the modularity PTJ , which is a quality func- 
tion used to find the communities as a group of densely 
connected nodes, can also be interpreted by the random 
walk and some disadvantages of the modularity can be 
easily resolved in this alternative approach. 

The map equation method |B] detects communities by 
the information-propagation-based definition, under the 
philosophy of Minimum Description Length (MDL) prin- 
ciple 8J . The basic idea of the MDL principle is that any 
regularity in the data can be used to compress the length 
of the data. If we can find a way to encode the path of 
random walk on the network and consider the commu- 
nity structure as the regularity in the network, commu- 
nity structure can be detected by finding the partition 
that gives the minimum description length of the path. 
In the map equation method, the encoding rule for the 
path description can be described as follows. 

To uniquely describe the path of a random walk on the 
network, the simplest way would be assigning a distin- 
guishable code to each node in order to avoid the ambi- 
guity, and the description length would become shorter 



when the more frequently visited nodes are given shorter 
code and less frequently nodes given a relatively longer 
code, which is the method known as the Huffman cod- 
ing [18 . However, assigning a unique code to each node 
in the network could be very inefficient if the network 
size is large, and the movement of the random walker 
is frequently trapped in a small area — the community of 
nodes. A better strategy would be dividing the nodes 
into communities and using the codebook of two lev- 
els: The first level code describes the community that a 
node belongs to, and the second level code distinguishes 
a specific node from other nodes in the same commu- 
nity. In this strategy, a community (first level) code 
should be recorded in the path description when and only 
when the random walker enters the new community from 
other communities, and the random walks that is taking 
place within the community can be uniquely described 
by recording only the second level code. Additionally, an 
exit code should be assigned to each community, and it 
should be recorded when the random walker is exiting a 
community, so that the first level code and the second 
level codes can be distinguished. The costs of using the 
two-level codes would be fully compensated if the com- 
munity structure is significant and it is well detected, 
because in this case the second level codes would become 
much shorter, and the first level codes of communities 
would not be frequently used, consequently reducing the 
total length of the path description. Therefore, the best 
partition of the network would be the partition that min- 
imizes the average description length of the path of the 
random walk under the coding strategy described above. 

Once the community partition M is decided, the prob- 
ability of each code being used can be easily calculated 
and the map equation Lnodccom(Af), which is defined as 
the theoretical minimum of average description length, 
can be given by the Shannon's source coding theorem |19j 
as 
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where i is the index of community, a is the index of node, 
and C is the number of communities; = X^iLi Qr^ 
is the total probability of using the first level codebook 
where is the probability of using the first level code 
for community i; = -|- J^a&iPa is the probability 
of using the second level codebook and the exit code for 
community i; and Pa is the probability of node a be- 
ing visited, which is equal to the probability of using the 
second level code for node a. H{Q) is the average de- 
scription length contributed by the first level codebook: 



(2) 



while H{P'^) is the description length contributed by the 



second level codebook for community i: 
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where pj^ is equal to when node a belongs to com- 
munity i, otherwise zero. The probability is included 
in Eq. ([3| to represent the contribution of exit codes for 
community i, and it can be computed from the following 
equation once the community structure M is given: 



(4) 



where Aq,^ is the element of the adjacency matrix, and it 
equals one if there is a link between node a and /3, oth- 
erwise zero; ka = is the degree of node a. The 
description length is measured in bits if the logarithm is 
taken with base 2 in the equations above. 

The community structure can be detected by finding 
the partition of nodes that minimizes the map equation 
^nodocom(-^) in Eq. Q, just like other community detec- 
tion methods based on maximization (or minimization) 
of the quality function. For example, many algorithms 
developed to maximize the modularity |17) can be di- 
rectly used to minimize the map equation by replacing 
only the definition of quality function in the algorithms. 
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FIG. 1: (Color online) The encoding rule for the random walk 
path description in our method, (a) The links are divided into 
two communities: the red (dark gray) one and the blue (light 
gray) one, and the underHned first level codes are assigned to 
each link community. The second-level codes are assigned to 
the nodes, and the node in the center is given two second- 
level codes because it belongs to both communities, (b) An 
example of the random walk path is depicted on the left side, 
and the description of the path is given on the right side. The 
underlined first-level code is recorded only when the random 
walker is moving across the communities, and it is omitted 
when the random walker is moving within the community. 



III. THE MAP EQUATION FOR LINK 
COMMUNITY 

Although various kinds of methods [5| have been devel- 
oped to find the communities, most of them are limited 
to the community of nodes. Some recent studies [7|[n], 
in which the link community is studied instead of the 
node community, showed that if the focus is alternated 
from the nodes to the links, a better description of the 
community structure could be found. In many real- world 
networks, a node could belong to several communities at 
the same time, and this fact makes the node community 
scheme fail to describe the organization structure of the 
system properly. For example, a person can belong to 
several social groups at the same time and an interdis- 
ciplinary research can belong to several scientific fields. 
Meanwhile, a link between a pair of nodes usually exists 
for a dominant reason, and the overlaps of communities 
over links would be less likely to happen compared to 
the overlaps of communities over nodes. The immediate 
advantage of the link community is that it can be used 
to detect overlapping communities of nodes, which is an- 
other active field of community identification [11] [12]. 
Although a link belongs only to a specific community 
when the links are partitioned into communities, a node 
could belong to multiple communities because the links 



connected to the node could belong to different commu- 
nities (i.e., the link communities are overlapping over 
the nodes). A similar discussion can be applied to the 
cliques [9] , which are the subnetworks of fully connected 
nodes, and the link community can be considered as a 
special case of clique community since a link is a clique 
composed of two nodes. 

In this section, we propose a modified version of the 
map equation that can be used to find the communities 
of links. Since the original map equation can only be ap- 
plied for node community, the encoding rule for the path 
of random walk needs to be modified. As illustrated in 
Figure [T] the first step of this modification is to let the 
partition M describe the link community instead of node 
community. The links are partitioned into communities, 
and the first level code is assigned to each link commu- 
nity. Meanwhile, the second level codes are still assigned 
to the nodes. The advantage of this encoding rule will 
be discussed later in Sec. IVII Since some nodes could be- 
long to multiple communities in this case, each of these 
overlapping nodes would be given multiple second level 
codes, as many as the number of communities the node 
belongs to. Once the first- and second-level codes are as- 
signed according to the community structure we assume, 
the path description is given as: (i) at each step, the ran- 
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dom walker is moving from the source node to the target 
node, which means the random walker is moving over a 
selected link that connects the source and target nodes; 
(ii) if the link bypassing at current step belongs to a dif- 
ferent community compared to the community that the 
link of previous step belongs to, the first level code for 
community is recorded before recording the second level 
code for the target node; (iii) if the links of the current 
step and the previous step belong to the same commu- 
nity, the first level code would be omitted and only the 
second level code for the target node is recorded; (iv) 
additionally, an exit code should be inserted before each 
first level code in order to distinguish the first level codes 
from the second level codes. 

The nodes that belong to multiple communities have 
multiple second level codes and this redundancy is likely 
to increase the length of the path description. However, 
if the link community is more significant than the node 
community (i.e., many nodes belong to multiple commu- 
nities), the redundancy can be compensated by reducing 
the frequency of using first level codes especially when 
the random walker visits the overlapping nodes and move 
back to the previous community. 

Once the encoding rule is given as above, we can get 
the map equation for link community if we know about 
the probability of using each code, and this computation 
of each probability can be easily done with the help of 
LinkRank |16| . LinkRank r^^, which is the probability 
of the link a — >■ /3 being visited by the random walker in 
the stationary state, is a constant value equal to \/2M 
in the undirected binary networks, where M is the num- 
ber of links in the network. We use r^^ to represent the 
community partition M: r^^ is equal to Tap if the link 
between nodes a and j3 belongs to community i, other- 
wise zero. Given the probability of visiting each link, the 
probability of using a second level code for node a in the 
community i is 

and the probability of using the first level code for com- 
munity i is = La 9ar>' whcre 

9i^=pL(i-^^), (6) 

Pa 

is the probability that the first level code being used after 
visiting node a. Here Pa is the probability of visiting 
node a and it satisfies Pa — J2iPa — ka/'^M, where ka 
is the degree of node a. 

Finally, the map equation for the link community can 
be given as 

c 

i^li„kcom(M) = Qr^HiQ) + (7) 

i=l 

where qr^ = J2a i larx the total probability of using 
first level codes, and p}j = q\-\- is the total prob- 

ability of using second level codes and the exit codes. 



H{Q) is the contribution of first level codes to the aver- 
age description length, and it can be computed by 

Similarly, H[P^) is the contribution of second level codes 
in community i to the average description length, and it 
can be computed from the following equation 

i?(n = -^iog^-E4iog^. (9) 

Po Po a Po 

Now this map equation for link community can be used 
as the quality function to find link communities, just like 
other quality functions of community detection. Thus, 
most of the algorithms developed for other quality func- 
tions can also be modified to minimize -t^iinkcom (-^^) 
Eq. 0. In this paper, we used a modified version of 
the algorithm developed by Rosvall and Bergstrom |20j . 
which is an extended version of the Louvain method [21] . 
The difference between our optimizing algorithm and the 
Louvain method is that the links, instead of the nodes, 
are locally grouped together to find the minimum effi- 
ciently. 

This method can be easily generalized to weighted net- 
works, in which weight is assigned to each link, and di- 
rected networks, in which direction is assigned to each 
link. In the weighted networks, the LinkRank is 
no longer a constant value, and it is proportional to the 
weight Wa/3 of each link. The remaining processes would 
just be the same. In the directed networks, the LinkRank 
Tap is a quantity related to the global structure of the net- 
work, and it can be computed by following the processes 
described in Ref. |Tn|. If the directed network is com- 
posed of only one strongly connected component (SCC), 
in which a directed path always exists between any two 
nodes in the network, the equations in this section can 
still be directly used. It is important to notice that the 
sequences of a and /3 in Eqs. ([s]) and (|6| are different. In 
the directed networks composed of more than one SCCs, 
the situation becomes complicated because there would 
be more than one stationary values for LinkRank. There- 
fore, the random hopping should be included in the ran- 
dom walk, which is the same as adding all-to-all links 
of small weight to the network, to ensure the existence 
of only one stationary value for the LinkRank. Thus, 
the original network becomes a all-to-all connected net- 
work and a link exists between any pair of nodes. This 
would make the minimization of the map equation com- 
putationally expensive, because the number of links to 
be partitioned would grow significantly. One possible so- 
lution is considering the random hopping links only when 
computing the LinkRank values and then normalize the 
LinkRank after removing the links generated by random 
hopping, as previously shown in Ref. |22j . 

During the submission of this paper, another exten- 
sion of the map equation for overlapping community is 
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FIG. 2; (Color online) The communities detected in the 
karate club network. The color of the nodes indicate three 
node communities detected by the original map equation 
method, which is minimizing I/nodecom. The color of the 
links indicate four link communities detected by our method, 
which is minimizing Liinkcom- iunkcom ~ 4.28 bits and 
inodecom ~ 4.31 Mts, for the community results illustrated 
in the figure. 



proposed by Esquivel and Rosvall [13]. It would be an 
interesting work to compare the performance of these two 
methods. 



IV. A REAL- WORLD NETWORK ANALYSIS: 
THE KARATE CLUB NETWORK 

We applied our method to the famous karate club net- 
work [23j , a social network analyzed in most community 
detection researches, and the result is illustrated in Fig- 
ure m The color of the links indicates the link commu- 
nities detected by minimizing the map equation for link 
community, Lunkcom, while the color of the nodes indi- 
cates the node communities detected by minimizing the 
map equation for node community, i„odecom- According 
to the result of node community, some nodes, especially 
nodes Nos.l and 2, are categorized in the red (center) 
community, while a large portion of their neighbors be- 
long to another community. In this karate club society, 
these nodes should be the members who connect different 
groups of people together, and their existence would be 
very important to integration of the whole society. How- 
ever, the multiple social roles of these nodes are not cap- 
tured in the node community scheme because the nodes 
are forced to belong to a single community. Meanwhile, 
the result of the link community, gives a much more in- 
tuitive interpretation of the organization structure. For 
example, the links connected to the No. 2 node are di- 



vided into two communities, blue (lower left) community 
and red (center) community, and the red links are con- 
necting other red nodes while most of the blue links are 
connected to the blue nodes. The links connected to node 
No.l or to other nodes that are located at the boundary 
of communities, show similar behavior. The link commu- 
nity scheme properly describes the multiple roles of the 
overlapping nodes, and it gives a more intuitive organiza- 
tion structure than the node community scheme, at least 
in this example. 

Meanwhile, it is important to notice that the link com- 
munity approach is not the perfect solution to the detec- 
tion of the overlapping communities. For example, nodes 
Nos.9 andl3 should belong to both the red community 
and blue community at the same time according to the 
result of link communities, but the connection between 
those two nodes is categorized only to the blue commu- 
nity. This result may not represent the relation between 
those two members properly because the interaction be- 
tween those two members very likely would be related to 
both the red and blue communities, not being limited to 
only one community as the link community result sug- 
gests. Thus, exclusive partitioning of the links may not 
represent the community structure of network well when 
communities of links highly overlap. However, the link 
community approach is a reasonable approximation that 
is quite effective in the practical applications. Firstly, 
its computational complexity is of the same level of the 
node community approach, while most other methods of 
detecting overlapping communities [iTJ [121 [El require 
much more complex algorithms. Furthermore, the hard 
partitioning of links may not be an important issue if one 
is interested only in identifying the overlapping roles of 
the nodes because the degree of a node is usually larger 
than the number of the overlapping communities a node 
belongs to. For example, although the link between nodes 
Nos.9 and 13 is exclusively partitioned to the blue com- 
munity, this result does not affect the detection of the 
overlapping roles of Nos.9 and 13. 



V. COMMUNITY RESULTS COMPARED WITH 
METADATA 

The qualitative explanation of the community detec- 
tion results, although interesting, has its limits in ver- 
ifying the validity of the methods. A more solid ap- 
proach would be comparing the community results with 
the metadata contained in the system, like the analysis 
in Ref. [7 . We analyzed four networks with rich meta- 
data, which are listed in Table [l] The first is a sampled 
citation network of APS journal articles, which is con- 
structed from the APS Data Sets for Research [23]. The 
sampled articles are the first- and second-level neighbors 
of a review paper [2] for complex networks. The metadata 
used to compare the results are the PACS (Physics and 
Astronomy Classification Scheme) numbers annotated to 
each article. Since the authors carefully choose the PACS 
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extended Jaccard Index 



TABLE I: The real-world networks with metadata. A*' is 
the number of nodes, M is the number of links, Cmctadata 
is the number of categories in the metadata, Cnnkcom is the 
number of communities detected by our method, and Clc is 
the number of communities detected by the link clustering 
method [7]- 



[~| extended NMI 





N 


M 


C*nictadata 


C'linkcom 


Clc 


APS sample 


4755 


29669 


1076 


339 


14891 


Metabolic [7| 


1042 


17512 


169 


156 


2304 


Philosopher [7] 


1219 


5972 


5417 


152 


2777 


Word Assoc. [7] 


5018 


55232 


13141 


765 


36654 



numbers to make their articles well publicized, it is rea- 
sonable to consider the PACS numbers as rich and trust- 
ful metadata. The other three networks were previously 
constructed and analyzed in Ref. 7 . The metabolic net- 
work is constructed from E. coli K-12 MG1655 strain, 
and the metadata used are the pathway annotations from 
the KEGG database [53]. The philosopher network is a 
network of Wikipedia pages for philosophers, with each 
link representing the hyperlinks in the articles, and the 
metadata are the categories that each page belongs to. 
The last network analyzed is the word association net- 
work, which is constructed from the datasets about free 
association of word pairs and the metadata are the 
meanings or definitions assigned to each word in Word- 
Net database i27i . 

In these networks, each node is annotated with single 
or multiple metadata, and the metadata can be consid- 
ered as the overlapping communities because they are 
closely related to the grouping of nodes. Also, the result 
of our method, in which the communities of links are 
detected, could be considered as the overlapping com- 
munities of nodes. Thus, comparing the result of our 
method with the pre-assigned metadata can be consid- 
ered as comparing two different results of overlapping 
communities. Although several criterions have been pro- 
posed for comparing overlapping communities, none of 
them is as conclusive as the variation of information 
(VI) ^2B], which is a well-defined and widely accepted 
criterion for comparing two non- overlapping community 
partitions. In order to overcome the disadvantage of indi- 
vidual criterion for overlapping community, we compare 
the metadata with the link community results by two 
fundamentally different criterions, the extended normal- 
ized mutual information (NMI) and the extended 
Jaccard index [30], in order to observe the results from 
different aspects. Another extension of the mutual infor- 
mation for comparison of overlapping communities can 
be found in Ref. [Uj. Although this method is a better 
approach compared to the extended NMI we used, it is 
not used in this work because in some of our examples 
one metadata may fully contain another metadata and 
the method cannot be used in this kind of cases. 

The extended NMI is an information theory based 
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FIG. 3: The performance test of various community- 
detecting methods. The communities detected by each 
method are compared with the metadata, and the perfor- 
mance is measured by the extended Jaccard index and ex- 
tended NMI. Linkcom represents the result of our method, 
LC represents the link clustering method [7], GCE represents 
the greedy clique expansion method [35] , CPM represents the 
clique percolation method, Louvain represents the fast un- 
folding method in Ref. [21], Nodecom represents the original 
map equation method [6], LP represents the labal propaga- 
tion method [33], and CNM represents the Clauset-Newman- 
Moore method |34) . The first four methods are able to de- 
tect overlapping communities, and the last four methods are 
not 



measurement and is defined as 



N{X\Y) 



1-^[H{X\Y) 



H{Y\X), 



(10) 



where X and Y are two different partitions of overlap- 
ping communities and H{X\Y) is the conditional en- 
tropy that measures the amount of information needed to 
infer X given the partition Y. The extended NMI ranges 
from to 1 and it equals to 1 only when two partitions 
X and Y are identical. Meanwhile, the extended Jac- 
card coefficient falls into the category of external indexes 
that measure the similarity of two partitions statistically. 
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This index is defined as 



CLG + da, 



(11) 



where ac and da measure the agreement and disagree- 
ment of partition X and Y respectively. The index satis- 
fies i^{X, Y) £ [0, 1], reaching 1 only when X and Y are 
identical, and it reduces to the original Jaccard index in 
Ref. [3T] when X and Y are non-overlapping partitions. 

We applied our methods to the four networks and the 
detected communities are compared with the metadata 
by the extended Jaccard index and the extended NMI. 
The result is presented in Figure [3) and the results of 
other community detection methods are also presented 
together to make a comparison. The first four methods, 
which are able to detect overlapping communities, show 
much better performance compared to the last four meth- 
ods, which are able only to detect hard-partitioning com- 
munities. This result indicates the importance of detect- 
ing overlapping communities in recovering the properties 
of individual nodes. The first two methods, our method 
and link clustering method .7, , which are detecting over- 
lapping communities for nodes by detecting link commu- 
nities, show significantly better performance — both the 
extended Jaccard index and the extended NMI show- 
ing meaningfully large value through the four networks 
analyzed — compared to other methods, indicating the 
overlapping communities for nodes can be efficiently de- 
tected by finding the link communities. 

It is important to notice about our method and link 
clustering method, that both detect link communities but 
detect the communities at different hierarchical scales. 
As listed in Table |l] the number of communities detected 
by the link clustering method is much larger than our 
method, indicating our method detects communities of 
relatively larger size and the link clustering method de- 
tects communities of relatively smaller size. It would 
be necessary to consider this scale factor when deciding 
which method to use in order to analyze the community 
structure of networks. It seems like this difference orig- 
inates from the different optimization goal of two meth- 
ods, but the fundamental cause of this difference is left 
unknown at this time. 



VI. COMPARISON OF LINK COMMUNITY 
AND NODE COMMUNITY 




FIG. 4: (Color online) The model network that is proposed 
to verify the significance of overlap O. This network is a 
variation of the Erdos-Renyi random network, and two com- 
munities, the red (left) and the blue (right), are embedded in 
the network. There are a total of 27V nodes in the network, 
and 2n of them (green or middle-gray nodes) are overlapping 
nodes while the other nodes are non-overlapping nodes, with 
N — n nodes exclusively assigned to each community. The 
probability of connecting nodes in the same community is 
Pin, and it is much larger than Pout, which is the probability 
of connecting nodes from different communities. 
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FIG. 5: (Color online) The minimum average description 
length detected by our method and the original map equa- 
tion method in the model network. The red (filled) squares 
represent the minimum of iiinkcom given by our method, and 
the black (empty) squares represent the minimum of I/nodocom 
given by the original map equation method, for different val- 
ues of overlap strength n. The inset shows the value of the 
significance of overlap O. 



It is interesting to notice that in the result of karate 
club network, which is illustrated in Figure [2] the map 
equation for link community, iiinkcom, is smaller than 
the map equation for node community, Lnodocom- Re- 
minding that the map equation measures the amount of 
unknown information about the structure of the network 
assuming the community structure is already known, for 
each of iiinkcom and inodccom a smaller value of the map 
equation indicates that the community structure we as- 
sumed is a more proper description about the organiza- 



tion structure of the network. This reasoning can be ex- 
tended to the comparison between Lunkcom and Lnodccom- 
In both of the methods, the rule for random walk is the 
same, the second level codes are all assigned to the nodes, 
and the description length is measured in the same unit. 
Therefore, the only possible cause for the difference be- 
tween Liinkcom and inodocom IS the different rules for the 
first level codes, and this difference can be used to test 
which encoding rule is better — the link community or the 
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TABLE II: The significance of overlap O measured in some real-world networks. The networks are listed by the descending 
order of O. 



Network 


N 


M 


< A; > 


-'-'linkcom 


-^nodccom 


O 


High-energy theory coUaborations |36j 


8361 


15751 


3.8 


5.89 


6.56 


0.0539 


Network science collciborcLtions |3T] 


1589 


2742 


3.5 


3.48 


3.77 


0.0397 


PnwpT ori H 1*^81 


4941 


6594 


2.7 


5.19 


5.60 


0380 


ATTifi7nTi pnin pn-nnT*pVi?ic;p 171 


18142 


46166 


5.1 


5.84 


6.11 


0.0224 


Political hloffs ROl 


1490 


19090 


25.6 


8.65 


8.93 


0.0163 


Word £LSSocia,tion [T] 


5018 


55232 


22.0 


11.00 


11.18 


0080 


APS journa.1 citcitions (sampled) [24] 


4755 


29669 


12.5 


8.82 


8.96 


0.0076 


Protein-protein interaction [T] 


2729 


12174 


8.9 


6.70 


6.79 


0068 


V V Wl Vj. (XvJ. 1 Clkjdi\_.±dD ^^^^ 


112 


425 


7.6 


6.27 


6.35 


0068 


Les AUsevdblcs [40| 


77 


254 


6.6 


4.64 


4.68 


0.0043 


Political books |41| 


105 


441 


8.4 


5.44 


5.48 


0.0037 


Zachary's karate club [23] 


34 


78 


4.6 


4.28 


4.31 


0.0035 


Dolphin social network [42] 


62 


159 


5.1 


4.83 


4.85 


0.0024 


Philosopher [7] 


1219 


5972 


9.8 


8.43 


8.46 


0.0018 


Jazz musicians collaborations |43] 


198 


5484 


55.4 


6.91 


6.91 


0.0002 


C. Elegans neural [38] 


297 


2359 


15.9 


7.52 


7.46 


-0.0041 


i?. co/j metabolic [7] 


1042 


17512 


33.6 


8.33 


8.25 


-0.0053 


American College football [44| 


115 


616 


10.7 


5.66 


5.44 


-0.0199 



node community. For example, if the minimum value of 
-^linkcom IS Smaller than the minimum of inodocom, one 
can conclude that the link community scheme is bet- 
ter than the node community scheme in representing the 
organization structure of the network, because the link 
community scheme subtracted more information about 
the structure and left less unknown information in the 
path description. Instead, if the I/nodecom is smaller, 
this means that there is no much overlap of communi- 
ties over the nodes and the non-overlapping methods arc 
good enough to study the community structure of the 
network. 

To quantitively analyze the difference between 
^nodocom and -Liinkcom, wc proposc a quantity called the 
significance of overlap: 



O 



-'nodecom 



^linkcom 



-'nodccom 



linkcom 



(12) 



This quantity measures how much better the link commu- 
nity scheme is compared to the node community scheme, 
and furthermore it can also be used to measure the over- 
lapping strength of communities. The significance of 
overlap satisfies O G (—1,1), and it is positive when the 
link community scheme is better, being negative other- 
wise. 

In order to check the validity of this quantity, we pro- 
pose a model network (Figure |4]) generated as follows. 
The model network is based on the Erdos-Renyi net- 
work j45j . and two overlapping communities are embed- 
ded on the network. Among the 2N nodes of the network, 
2n nodes are overlapping nodes, while N — n nodes are 
exclusively assigned to each community. The probability 



of linking two nodes from the same community is pin, and 
the probability of linking two nodes from different com- 
munities is Pout- The two communities overlaps more 
when n is larger and overlaps less when n is smaller. 
Therefore, n can be considered as the parameter that 
controls the overlap strength of the two communities. 
Figure [5] shows the results of inodccom, ^linkcom and the 
significance of overlap O for different values of n, while 
the set of parameters are fixed as TV = 50, < k >= 10, 
Pout/ Pin = 15. The error bar indicates the standard de- 
viation over four hundred ensembles of the network real- 
izations. When the overlap strength n is small, inodccom 
is much smaller than Lunkcom, indicating the node com- 
munity scheme is better, and the significance of overlap 
O gets a negative value. As n grows, the significance 
of overlap O gets larger and it starts to get a positive 
value, which means the link community scheme is better 
in describing the organization structure. When n gets 
even larger, the overlap is too strong and the network 
is recognized as one community in both of the methods. 
Thus, the value of O falls to zero. This result matches 
our prediction well, therefore, the significance of overlap 
could be used as the quantitive measure of the strength 
of overlap. 

We measured the significance of overlap O for some 
real- world networks and the results are listed in Table HU 
Although we do not fully understand how to interpret 
the exact value of O yet, some conclusions can still be 
made by comparing the values of O with the result of 
the karate club network (Figure [2]) , in which the over- 
lap of communities is well observed. The significance of 
overlap in the karate club network is 0.0035, and many 
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networks show a larger value of O than the karate club 
network. Many social networks such as the collabora- 
tion networks of scientists, the network of political blogs, 
the social network in Les Miserables, and the dolphin so- 
cial network — show much stronger or similar degrees of 
community overlap compared to the karate club network, 
in accordance with the well-accepted knowledge that so- 
cial communities tend to overlap with each other. The 
biological networks such as the C. elegans neural net- 
work and the metabolic network show negative values 
of O, which means the communities in these networks 
do not overlap much, while the protein-protein interac- 
tion network shows a positive value of O. This result 
might be related to the different biological functions of 
the communities in these biological networks, and further 
investigations would be necessary. The college football 
network, in which the teams are divided into regional 
leagues and most games happened within the leagues, 
shows a non-overlapping community structure and this 
result strengthens the validity of the significance of over- 
lap. Finally, the fact that many networks have positive 
values of O indicates the overlapping community struc- 
ture exists in many real- world networks, and it is impor- 
tant to study the organization structure of these networks 
by detecting the overlapping communities, instead of in- 
sisting on the non-overlapping communities. 

VII. SUMMARY 

We proposed a method to detect link communiticis in 
networks by modifying the map equation method, which 
detects communities by minimizing the average descrip- 
tion length of the random walk. In our method, the com- 
munities are assigned to links instead of nodes, the en- 
coding rule for the random walk is modified to represent 
this change in the community structure, and the corre- 
sponding map equation for the link community is pro- 
posed. The map equation for link community could be 
used to detect the link communities by finding the link 
partitioning that gives the minimum value of the map 
equation, just like other quality functions, and most of 
the algorithms that were developed to maximize (or min- 
imize) other quality functions could be used after minor 
modifications. 



One of the advantages of our methods is that the over- 
lapping communities of nodes could be detected relatively 
easily, by defining the community of nodes by the com- 
munities of the links that are connected to the node. We 
tested our methods on some real-world networks by com- 
paring the community results with the metadata of the 
nodes, and the result is compared with other community 
detection methods. The result shows that the commu- 
nities detected by our method agree well with the meta- 
data of the nodes, and the link community scheme is an 
efficient way to detect the overlapping communities of 
nodes. 

Another important advantage of our methods is that 
the node community scheme and the link community 
scheme could be compared quantitatively. Since the dif- 
ference between the map equation for the link community 
and the map equation for the node community comes 
only from the diflference in community structure — the 
communities being assigned to the links or the nodes, 
the difference can be used to test which scheme, the link 
community or the node community, is better to repre- 
sent the organization structure of the network. We used 
a quantity named as the significance of overlap to mea- 
sure this difference in map equations, and the analysis 
of the significance of overlap in some real- world networks 
shows that many of the real-world networks can be bet- 
ter studied by the link community. Therefore, detecting 
the overlapping communities is necessary to understand 
the organization structure of the networks better, and 
finding link communities is an efficient way to detect the 
overlapping communities of nodes. 
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